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Abstract 

This technical report introduces a system developed by Auburn University and 
Clemson University The system securely maintains consistent data provenance by 
the use of cyrpto-currency primitives. This novel provenance approach comprises 
a reusable distributed service. It achieves scalabity by using distributed services to 
maintain ledger information. The system leverages Bitcoin, which is a successful dis- 
tributed cryptocurrency; the system does not require the Bitcoin proof of work, but 
leverages the blockchain infrastructure to maintain provenance metadata securely. We 
also leverage existing tools from Indiana University (Karma, Komadu) for provenance 
data exploration and visualization. Our approach allows signing by both server/ system 
and user to create dual information about possession, while distributed ledgers re- 
move control and maintenance of metadata from control of the user who creates it. 
This work is timely because of the need for higher assurance systems. 


1 Introduction 

Auburn University and Clemson University are developing, implementing and testing 
a prototype (an API, library, and distributed service) for securely maintaining consistent 
data provenance meta-information. Outcomes of this work will support both scientific 
workflows and cybersecurity assurance functions. This system builds on open-source Bit- 
coin infrastructure [12] to create distributed ledgers that maintain distributed provenance 
data outside of user control. Once information is enrolled on the ledgers, strong security 
guarantees make the forging and manipulation of provenance metadata impractical. 

In cloud computing, where proof of past data possession is difficult to obtain and cur- 
rent/past data location is hard to know, our system could efficiently maintain a reliable 
audit trail. Certifying chain of custody for files is at once both difficult and important. 
Custody information supports forensics and situations where multi-organization data 
sharing can result in a lack of trust in data quality. Providing secure data provenance 
information for cloud computing would enable greater trust in cloud computing by en- 
abling better digital forensics. By allowing increased data sharing for law enforcement, 
without fear of corruption or manipulation, criminal prosecution of online crime will be- 
come more likely. 

For NSF, a new emphasis on reproducibility of scientific results means that workflow 
systems and opt-in (voluntary/ planned) systems for documenting corpora of work need 
effective provenance. The proposed provenance system will make use of cloud resources 
more attractive to researchers, since they will be able to verifiably track the use of their 
data. It will also help reviewers, funding agencies, and other scientists ensure the repro- 
ducibility of published scientific results. 

Figure 1 illustrates the information flow in the system, where cybersecurity inputs are 
considered (rather than scientific workflow). Starting from the left, various cybersecurity 
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inputs are signed first by one user (user i), then by another (user j) 1 , after which the pro- 
posed system APIs take the signed input, utilizes the modified Bitcoin block chain, and 
produces entries in the distributed ledger. The provenance metadata can subsequently be 
viewed using tools shown at the lower right (e.g., Komadu from Indiana University). 



University Pervasive Technology Center. Available: http:// 
d2i.indiana.edu/tools 


Figure 1: The data provenance system integrates existing technologies using an innova- 
tive distributed ledger that guarantees security. 

The system comprises a computer and network security information provenance tool. 
This tool allows us to work with a variety of input data classes and a large volume of 
data. The system will be used by at least two professional information technology teams 
(Clemson and Auburn campus information technology professionals). Finally, using a 
security application, rather than a scientific workflow, allows us to consider a larger range 
of potential threats and data manipulation issues. Since all universities are under constant 
attack from external and internal threats, the system will also be faced with detecting 
subtle information manipulations in a way that may not be found in typical scientific 
workflows. 

1 User j could be the machine where the data is logged. For example, ssh assigns key pairs to both users 
and machines. This approach would certify both the user entering provenance data and the physical device 
used. 
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2 Related provenance work 

Provenance is widely used in the scientific and business communities, such as quality 
assessment of Web data [6, 7], cloud forensics [20, 22], process documentation [10], etc. A 
taxonomy of provenance applications is provided in [13]. 

The work presented in [7] looks at the quality of provenance data and, specifically, 
data timeliness. Given its application of Web data, timeliness is a real issue. Since our sys- 
tem is user-based and each user has a unique credential that is valid throughout the entire 
service period, data timeliness will not be a problem. In [6], the same author extends the 
work in [7] by presenting an approach for obtaining provenance information from the 
Resource Description Framework (RDF)-based metadata on the Web. The work reported 
in [6] could be a useful reference if the input data should contain Web data. According 
to [17], there are two granularities of provenance: workflow provenance and data prove- 
nance. Our data provenance prototype mainly processes workflow provenance, which is 
the "entire history of the deviation of the final output of," while Web data belongs to the 
latter, which strives to provide the deviation of single pieces of data. 

Cloud computing is rapidly developing; however, the public nature of the cloud has 
greatly broadened the attack surface. Digital forensics in the context of cloud computing 
therefore has risen in prominent and a new research topic, cloud forensics, has emerged. 
A key task in cloud forensics is to prove the the presence of a given piece of data. In [20], 
they introduced the notion of a Proof of Past Data Possession (PPDP). Cryptographic 
scheme for creating PPDPs is provided and testified on a commercial cloud vendor. The 
proposed scheme is based on the Bloom filter, which is basically a bitmap for recording 
user's activities and keeps updating. Bit positions are obtained form the hashes of new 
data instances and then used to update the Bloom filter. Since the Bloom filter is a statisti- 
cal data structure, the proposed scheme contains false positive errors. As we will use the 
modified blockchain, which stores the complete information of all previous events, there 
will be no statistical errors. One question we will explore is whether a modification of 
this approach can be integrated with Bitcoin block chains. 

Because of the vast quantity and complexity of data, extraction of provenance informa- 
tion from metadata has become an important research topic. In [14] Simmhan and others 
present a framework for collecting uniform and usable provenance metadata indepen- 
dent of the workflow implementation while minimizing modification and performance 
overhead. According to [14], provenance of a single workflow is a set of discrete activ- 
ities. Since workflow execution occurs at three levels: workflow level, service level and 
application level, each activity is fully described by four attributes: time, location, execu- 
tion level, and data products. With these definitions, they construct a provenance graph 
using stored activities. The algorithm is similar graph inferencefrom a set of node-edge 
pairs. Simmhan's work is part of the Karma Provenance Service in the domain of Linked 
Environments for Atmospheric Discovery (LEAD). Since we will integrate Karma's func- 
tionality into our system, we describe Karma separately. 

Other efforts focus on tracking /visualization of provenance metadata ( i.e ., dependen- 
cies between data products). One recently published work [8] extends the open prove- 
nance model (OPM), which is a graphical model for provenance, by clarifying the dis- 
tinction between precise and imprecise edges in the directed graph. Formal semantics is 
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proposed for the improved OPMs, and provides a criterion for defining the dependencies 
between instances. Complete OPM inference rules are provided accordingly. 

This project concentrates on securely maintaining provenance metdata. For that rea- 
son, we have decided to integrate our distributed ledger with the Indiana University 
Karma /Komadu tools. In this way, we can leverage existing results without duplication 
of effort. Our approach will allow strong security guarantees to be grafted onto the exist- 
ing infrastructure. 


3 Security Provenance Approach 

In this Section, we enumerate research questions that motivate this effort, describe our 
overall approach, explain how we integrate with third party tools, and detail our lever- 
aging existing Bitcoin infrastructure. 

3.1 Key Research Questions to be Answered 

The following research questions are the central research concerns of this work: 

1. Can advances in cryptocurrency be adapted to more general security applications? 
How addressed: Proof by construction of the prototype. 

2. Is the Proof of Past Possession or Permacoin approach, or a hybrid thereof the best 
approach to securely maintaining data provenance? How addressed: Our system 
allows us to experiment with both, evaluate, and consider hybrids. Solutions to this 
question will include analysis, and experimental results using our prototype. 

3. For opt-in provenance (such as scientific workflow systems), is the prototype system 
an effective means to generate and maintain provenance data? How addressed: We 
will show conceptually how to generate provenance from a scientific workflow as 
one of our activities. 

4. For opt-in provenance (such as scientific workflow systems), what is the efficacy of 
the prototype for understanding provenance after the fact? How addressed: Our 
system interfaces with the Komadu system, which will allow after-the-fact visual- 
ization of provenance data. We will also provide our tool to professional security 
personnel, who will provide feedback. They will also aplly the tool to discovery of 
data manipulation. 

3.2 Architectural Overview 

The system, illustrated in Figure 2, securely maintains scientific workflow and/or cy- 
bersecurity provenance information. We are creating a system that maintains inviolable 
provenance ledgers of scientific data and security information. The system consist of: 

• An API for integrating, storing, and retrieving data inputs, 

• Cryptographic signatures that assure ledger entry origin and integrity. 
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Figure 2: Data provenance system integrates existing technologies using an innovative 
distributed ledger that guarantees security. 

• A distributed ledger, building on Bitcoin's code base 2 , that efficiently maintains a 
consistent set of provenance records. It uses a modified version of Bitcoin's block 
chain approach to maintain a secure ledger, but with less computational overhead. 

• An API for: 

- Reading ledger entries, 

- Making new ledger entries, 

- Reading data sources referred to in the ledger entries, and 

- Verifying the security and integrity of data and provenance metadata. 

• Provenance visualization outputs, built by leveraging code from Indiana Univer- 
sity's Karma 3 and Komadu 4 tools. 

The system architecture, see Figure 2, has three main components: 

1. System inputs use an API to package data inputs for data provenance registration. 
Each provenance data entry contains at least: (i) data origin URI, (ii) data storage 
URI, (iii) SHA-256 (or SHA-512) hash of data file contents, (iv) time stamp, (v) pub- 
lic key signature from the responsible user, (vi) public key signature from the data 
origin machine, and (vii) public key signature from the data storage machine. 5 Data 

2 https://bitcoin.org/en/download (last visited 05/2105) 

3 http : //d2i . indiana.edu/provenance_karma (last visited 05/2015) 

4 http : //d2i . indiana.edu/provenance_komadu (last visited 05/2015) 

5 We support both secure shell (ssh) and transport layer security (TLS) public key infrastructure for public 
key signing operations. We will not use the current certificate authority hierarchy for accepting certificates. 


D-5 






processing workflow entries will have this information for each data input, pro- 
grams used, and each data output. 

2. Provenance entries are stored in a distributed ledger forked from the Bitcoin code 
base [12]. Bitcoin uses a peer-to-peer network of ledgers linked by a series of time 
stamped cryptographic hashes to provide absolute certainty that currency transfers 
are not fraudulent. We will adapt their secure hashing approach to be certain of data 
integrity. We will also use their ledger synchronization code, which has been shown 
to be efficient and reliable. However, Bitcoin has constraints that we do not need to 
enforce, notably: avoiding currency being spent twice and limiting the total num- 
ber of Bitcoins. Those Bitcoin constraints are maintained through computationally 
intensive processing (proof of work). Our constraints, mainly data integrity, can be 
enforced with much less computation. 6 

3. Provenance data exploration will be provided by leveraging existing provenance 
toolkits, primarily the Karma and Komadu tools from Indiana University 7 . We need 
to modify their data retrieval code to interface with our ledger system and automat- 
ically to perform integrity verification. 

3.3 System Inputs 

The prototype supports both data security and science workflow applications. The APIs 
can easily be adapted to either application. This is advantageous, since a system with 
more users and ledgers will have stronger security guarantees. 

The system will also support data security applications. This choice allows the system 
to process large volumes of heterogeneous data during testing phases. This approach 
also enables us to work closely with the Clemson and Auburn data security teams during 
project testing. This section outlines a minimal set of data sources that are maintained 
during the development. Other data sources will be solicited from the data security pro- 
fessionals using the system. 

Maintaining consistent logs of security events, that include audit information, is chal- 
lenging as an initial application. One of these challenges is the fact that intruders try 
to forge and modify system security information as a matter of course. This gives the 
research team a ready set of fraudulent entries to be discovered. It also provides an in- 
centive for the security team partners working at Clemson and Auburn actively to use 
and critique the tools they receive. We posit that active fraud occurs more often in sys- 
tem intrusions than in peer-reviewed science, making this a better domain for security 
verification than scientific-workflow. 

Enormous volumes of security data are generated and collected every day on the 
Clemson and Auburn campuses. Since Clemson is both a research university and the 

Users will need to register the set of certificates manually that they trust. Open-SSH is released under a 
minimally restrictive BSD license. Open-SSL is released under an open source license that mainly restricts 
use of the name Open-SSL. 

6 Bitcoin is released under the MIT license which would permit the modification and use as foreseen by 
this project. 

7 Karma and Komadu are both released under the Apache license, which is largely consistent with the 
BSD and MIT licenses of the other components we are integrating into this system. 
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cloud computing provider for the State of South Carolina's healthcare services, the Clem- 
son CCIT partner for this work has to use security logs for multiple purposes: as inputs 
for security research, for intrusion detection, or for documenting legal compliance. Man- 
aging security data is challenging: 

• Security data comes from a variety of sources including CCTV surveillance sys- 
tems, system and application loggers, monitoring and auditing tools, penetration 
test tools, and etc. 

• The system needs to treat multiple classes of syntax (network traffic, system logs, 
videos etc.) and semantics (CPU usage, memory usage, traffic details, network 
statistics, vulnerability report, wellness report, etc.). 

• Different security data classes have different storage and archival needs. They have 
varying work-flow, distribution, and protection requirements. 

Data inputs to the system prototype will include, but is not limited to, Nagios data, sys- 
tem logs, nmap scan results, Intrusion Detection System (IDS) data, operator text com- 
ments, network flow information, and processed data. The IT professional security staffs 
at Auburn and Clemson, that will use the system during beta and post-beta testing, will 
be asked for additional data inputs that they would want to track. The APIs being devel- 
oped will support the introduction of new data types with minimal work. 

A. System Logs 

System logs, such as Windows Event Logs or Linux system logs, are text files recording 
important events (such as user log-ons, program errors, driver updates, system crashes, 
etc.) that occur in a computer system. Events to be recorded are pre-defined by the op- 
erating system. System logs are vital for computer system management and security. 
They provide an audit trail to diagnose the problems during the execution of an oper- 
ating system. They provide valuable information that allows identification of anomalies 
and security breaches. Attackers, after successfully intruding into a system, modify sys- 
tem logs to hide their traces. Security measures must be in place to prevent unauthorized 
alteration of system logs and detect them when they occur. 

B. Other Logs 

Many applications and servers create logs recording events such as user access, user op- 
erations, application errors, crashes, software updates, etc. Like system logs, application 
logs are useful not only for diagnosing application problems, but also detecting security 
breaches such as application misuse, unauthorized data access, unauthorized data mod- 
ifications, etc. Access and operation authorizations for logs must be carefully set and 
enforced. 

C. Nagios Data 

Nagios 8 is a widely used open source software for monitoring IT infrastructure. Na- 
gios can monitor hosts (Windows machines, Linux machines, network servers, network 
printers, etc.), network elements (switches and routers), and services (HTTP, FTP, SSH, 
etc.). Nagios records the status of the objects being monitored in a text status file. For 

8 https : //www. nagios . org/ Nagios is open source with a license that allows use in an application such 
as this, but does not allow forking of the Nagios software. To comply with this, the project will read and 
process Nagios outputs without modifying Nagios source code. 
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# NAGIOS status file 

# 

# THIS FILE IS AUTOMATICALLY GENERATED 

U BY NAGIOS. DO NOT MODIFY THIS FILE! 

info { 

■ereated=1233491098 
verslon=2. 11 
> 

program { 

mod i f ied Jhos t_att r 1 bute s = 0 

mod I f ied_s er v±ce_at t rlbut es =0 

na^ios^jp id = 1S01B 

daeimb njiiibd e= 1 

p rogr am_ st a r t=12 3 349039 3 

la st _c omma n d_ehee k= 0 

la st_log_rot ation =0 

ena bl e_not if icat ion s=l 

act i ve_ se r vi c e_eh ee k s_en a bl ed = 1 

pa s s i ve_ se r vie e_c hec k s_e n ab led = 1 

ac t i ve_ho s t_c he c k s_ena b 1 ed=l 

pa s s i ve Jho st_ehe c ks_enab 1 ed=l 

ena bl e_e ve n t _ha nd ler s= 1 

ob s es s_o ve r_s e r v i c es =0 



Figure 3: Nagios status file. 


Figure 4: Nagios report. 


hosts, the status can be CPU load, memory usage, users logged in, running processes, 
online /offline, and etc. For network elements, the status can be packet loss, traffic rate, 
and etc. For services, the status can be protocol version, service reach-ability, response 
time, etc. Nagios status file is critical for determining the healthiness and security of mon- 
itored objects; therefore, it must be protected from any modification. Nagios also parses 
the status file and generates status reports in HTML. The reports present easy-to-read sta- 
tus history over user-defined period of time. These reports must also be protected from 
change. Figure 3 shows an example status file, and Figure 4 shows an example report. 

D. Nmap Data 

Nmap 9 is an open-source security scanner used to discover hosts and services on a net- 
work. Nmap sends specially crafted packets to target networks and hosts, and then anal- 
yses the response from the targets to generate security critical information such as hosts 
connected on a network, open ports, running application protocols and versions, oper- 
ating system and hardware characteristics of the hosts, and more. Attackers use Nmap 
results to map the topology of a network and determine potential attacking objects, while 
network and system administrators use Nmap results to find vulnerable points that must 
be mended. Figure 5 shows the results of an example Nmap scan. Nmap allows users to 
save scan results in textual files in various formats including XML. 

E. IDS Data 

An IDS system monitors network or system activities, identifies possible security viola- 
tion activities, logs information about them, and reports them to security administrators, 
and generates reports. The log files that IDS generate can be massive depending on the 
volume of traffic and information they handle. They provide valuable support for diag- 
nosing and reviewing security problems. 

F. Other Inputs 

In addition to the data listed above, other massive security data from security tools (such 
as Backtrack, Metasploit, Wireshark, Nexus, etc.) will be interfaced to the prototype sys- 

9 https : // nmap . org/ 
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Figure 5: Nmap scan results. 


tem. Clemson University is a pioneer in the use of software defined networking (SDN). 
An interface will be crafted that allows SDN controllers to log information in the prove- 
nance system. The API will also support logging events where programs process logs 
and produce derivative data. 

3.4 Distributed ledger 

The data input API extracts provenance information from the data inputs in Section 3.3 
and stores them in a distributed ledger system. This ledger is being constructed by mod- 
ifying the Bitcoin code base. This will provide security guarantees for the provenance 
data. The Bitcoin cryptocurrency maintains its value by using peer-to-peer distributed 
ledger software to track the ownership of every Bitcoin. The security of this tracking is 
guaranteed by maintaining a cryptographically signed chain of secure hash values. Since 
this chain is stored at multiple sites and is also being continuously updated it will be func- 
tionally impossible for this chain to be manipulated by fraudsters. The fact that Bitcoins 
have been able to retain their value despite clear monetary incentive to violate the security 
of the system is an strong evidence of the practical validity of this security approach. 

We are currently modifying the Bitcoin code base to provide the same security guar- 
antees for provenance data. This section discusses Bitcoin's approach to security and 
how that infrastructure can be modified to support provision for security guarantees to 
provenance metadata. This is a novel approach. It is challenging to modify the Bitcoin 
approach in a way that maintains security without requiring an exhorbitant amount of 
computation. However, the success of Bitcoin shows that using their ideas is a low-risk 
way of efficiently providing effective security guarantees. 

The Bitcoin block chain is a distributed public ledger in which every Bitcoin trans- 
action is registered [12]. Block chains allow value to be exchanged between two parties 
without the need for any intermediary. Bitcoin ledgers are a new distributed harmonic 
system that allows transactions, or other data, to be securely stored and verified without 
a centralized authority. Instead, Bitcoin transactions are validated by the entire network. 
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A working model of the Bitcoin system is showed in Figure 2. Those transactions do not 
necessarily have to be financial and the data can be of any form or format [11]. 

A key property of block chains is that the present state can be computed by anyone 
using the protocol, the genesis block, and a particular history [5]. Simply applying each 
transaction in the history to the genesis block, yields a final result that should be the 
current state. The problem is to find a consistent history. The solution provided by Bitcoin 
states that the right history is the one with the most proof of work. So as far as the network 
is concerned, the correct chain is the longest valid chain available on the network. This 
implies that the Bitcoin block chain is an eventually consistent protocol, where all the 
nodes have the same ledger at particular time through exchange of block chain packets. 

Computations that define Bitcoin are not restricted to currency; almost any data can 
be managed using the block chain. This removes the need for trusted third parties. One 
benefit is that data can be definitively verified and time-stamped. Given this ability, a dis- 
tributed ledger can ensure the integrity of data provenance, creating trust and reestablish- 
ing the notion of originality of key documents and artifacts. Since our goal is provenance, 
being able to assure originality is key. 

However, distributed ledgers and cryptocurrency systems are fundamentally differ- 
ent [16]. The key difference is how transactions are validated. Bitcoin uses pseudony- 
mous and anonymous nodes to validate transactions whereas distributed ledgers require 
legal identities (that is, permissioned nodes) to corroborate transactions. As a result, dis- 
tributed ledgers are able legally to host off-chain assets because of their authenticated 
approach to validation. Bitcoin and other permissionless systems cannot do so. 

A distributed ledger system usually utilizes cryptocurrency-inspired technology to 
verify or store votes (e.g. r hashes). While some platforms use tokens, they are utilized 
more so as receipts and not necessarily commodities or currencies. The Bitcoin block 
chain is commonly characterized as a distributed ledger. 
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Figure 6: The working model of Bitcoin, adapted from [12]. 

A Bitcoin address is actually a Cryptographic Hash of the Public Key. It is calculated 
as follows: 
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Version = 1 byte of 0 (zero) ; on the test network, this is 1 byte of 111 
Key hash = Version concatenated with RIPEMD-160 (SHA-256 (public key)) 

Checksum = 1st 4 bytes of SHA-256 (SHA-256 (Key hash)) 

Bitcoin Address = Base58Encode (Key hash concatenated with Checksum) 

Bitcoin uses Elliptic Curve Digital Signature Algorithm (ECDSA) to sign the transactions [3]. 
Transactions are cryptographically signed records that reassign ownership of Bitcoins to new ad- 
dresses. Transactions have inputs, which are records to reference the funds from other previous 
transactions, and outputs, which are records which determine the new owner of the transferred 
Bitcoins, and which will be referenced as inputs in future transactions when those funds are re- 
spent. The contents of a block and a transaction are shown in Figure 3. Each input must have a 
cryptographic digital signature that unlocks the funds from the prior transaction. Only the person 
possessing the appropriate private key is able to create a satisfactory signature; this in effect en- 
sures that funds can only be spent by their owners. Each output determines which Bitcoin address 
is the recipient of the funds. 

Traditionally, one way to undermine a peer-to-peer network is by creating large numbers of 
pseudonymous identities in order to gain a disproportional degree of influence (or votes). This 
is called a Sybil attack [4]. Bitcoin was purposefully designed to make it expensive to attack the 
network [12]. It establishes an ordering for transactions through a 'proof-of-work' process, which 
is a incentive-based scheme for countering the Sybil Attack. This approach makes it costly for 
an attacker to impose a new transaction ordering that could allow them to double-spend transac- 
tions [1]. 

As an example, let's say that Alice wants to double spend her bitcoin. She uses an automated 
system to create numerous distinct identities on the network. She tries to double spend the same 
coin with both Bob and Charlie. But when Bob and Charlie ask the network to validate their 
respective transactions, Alice's manipulated identities clog the network, announcing to both of 
them that they have validated his transaction, possibly fooling one or both into accepting the 
transaction. 
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Figure 7: The contents of a block and a transaction 
The solution provided by Nakamoto is a combination of two segments: (1) To make validation 
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of a transaction computationally expensive, and (2) to give them incentives for validating trans- 
actions. The benefit of making it costly to validate transactions is that validation can no longer be 
influenced by the number of false network identities someone controls, because they would have 
to incur huge computational cost in order to do so, rendering this strategy impractical. This solu- 
tion is known as proof of work [2], as noted above. Proof-of-Work uses the SHA256 hash function 
to prevent the problem of double spending. To validate a block, a user has to crack a 256 bit hash 
value for a certain output with a predetermined characteristics assigned to it. Specifically, let s 
be a base string, and h be the SHA256 hash function. The miner has to find a string n called the 
nonce, for which h(sn ) = c, where c is the output with variable number of zeroes at the start. 



WORKING OF THE BLOCK-CHAIN IN PERMACOIN 


Figure 8: The working scheme of the block chain in Permacoin. 

However, Bitcoin's computationally expensive 'mining' is in fact not required, should the sys- 
tem have other security requirements. One modification for a less computationally expensive 
approach is given in [9], where the Bitcoin block chains are modified to support distributed stor- 
age of archival data. This scheme is called Permacoin. Unlike Bitcoin, Permacoin clients use a 
computationally inexpensive scratch-off puzzle (SOP) based on Proofs-of-Retrievability (PORs) 
instead of Bitcoin's Proof-of-Work. This modification facilitates highly decentralized file storage 
and provides security guarantees similar to Bitcoin, but without the high computational overhead. 
Figure 4 describes the block chain architecture in Permacoin. 

To avoid Bitcoin's overhead, we are using either Permacoin's proof approach [9] or proof of 
possession [21] instead of proof of work to avoid the insertion of incorrect block chain information. 
This will allow the system to provide the same security guarantees as Bitcoin without requiring 
and exorbitant volume of computation. We are studying, comparing, and contrasting utilizing a 
hybrid of both in our system. That is, we are evaluating the comparative strengths, weaknesses, 
performance, and overheads of both approaches. 

3.5 Tools 

Provenance of digital scientific data is a vital component in the field of sharing and reuse of sci- 
entific data. Provenance provides the information associated with authenticity and evaluates the 
quality (and history) of a particular data set. Provenance collection can be considered to be part 
of a cyber-infrastructure system, but it is also viable as a standalone tool. The Karma provenance 
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tool is a standalone tool that, in particular, can be added to existing cyber systems for the pur- 
poses both of collection and representation of provenance data ([18]). Karma was developed by 
Prof. Beth Plale at Indiana University. Karma is designed around a modular architecture that 
supports multiple instrumentation plug-ins that renders it flexible enough to support various ar- 
chitectural settings. It provides downstream provenance of a given data object. The resulting 
provenance trace has the input data object as the source for all other elements in the given prove- 
nance trace. 

Karma facilitates visualization of provenance data, which is more useful with support for ma- 
nipulating large structures and for interactivity. This can help a user to navigate their experimen- 
tal information with a mental map of what is going on in the experiment. The Karma provenance 
framework records uniform and usable provenance metadata for scientific workflows. 

This tool collects two forms of provenance: the first being process oriented, which describes 
the workflows execution and is used to monitor the workflow progress and mine it for results 
validation ([15]). It also collects data provenance, which provides complementary metadata about 
the derivation history of data products in the workflow, including the services that create and use 
it, and the input data transformed to generate it, and forms the basis for quality-oriented data 
product discovery. 

Komadu is another standalone provenance collection tool, also developed by Indiana Univer- 
sity ([19]). Komadu provides a Web Services API and a Messaging API for both provenance collec- 
tion and querying of collected data. Provenance collection is driven by notifications that represent 
a particular event related to some certain activity, entity, or agent. The query API can mainly be 
used to find details about a particular activity, entity or agent and to generate the provenance 
graph for a particular activity, entity or agent. Unlike Karma, Komadu is not tightly coupled ei- 
ther to workflows or scientific provenance collection. The Komadu API uses generic terms and 
operations such that any type of provenance can be captured and queried. 

Since our goal is to ensure data provenance using the Bitcoin block chain, the distributed 
ledger can provide a validated workflow to be introduced into Karma and Komadu. These two 
tools are directly relevant to our effort. We will feed data from the ledger directly into Karma 
and/ or Komadu. If that is not possible, we will create code to extract and validate the data, then 
store store it in a database, which, in turn, can be input to Karma and/ or Komadu. 


4 Current Status 

Currently, we have designed three different use cases: data forensics (see Figure 11), academic 
integrity (see Figure 10), and maintaining security logs (see Figure 3.3). We see that everyone who 
comes in contact with the data (< e.g ., hard disk, experiment results, or security logs) is reflected in 
the provenance. It is important that privacy and trust are not violated by any of the parties. 

We have designed APIs for these use cases, created an API architecture for the prototype (see 
Figure 9,) and we are currently integrating the different APIs necessary for the alpha version 
release.. We initiated threat model analysis for bitcoin mining, surveyed crypto-currency mining, 
and provenance systems. 


5 Conclusion 

In this technical report, we give a detail information about a system developed by Auburn Uni- 
versity and Clemson University. The system leverages the use of crypto-currency technology to 
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Figure 9: Current API Architecture. 

Academic Provenance 




Reviewers 


Figure 10: Academic Integrity Use Case. 


secure data provenance outside of users control. The system makes the use of cloud resources 
more attractive since they will be able to verifiably track the use of their data. It will also help 
funding agencies and other stakeholders ensure the reproducibility of scientific results. The sys- 
tem allows data provenance to be applied to other systems and data activities besides workflows 
and the cloud. 
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Evidence Provenance 



Figure 11: Data Forensics Use Case. 


Network Security Provenance 



Figure 12: Security Logs Use Case. 
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